Introduction to Data Science with R

Session 5a: Importing data

Ina Bornkessel-Schlesewsky

November 29, 2023

Working with data from a file


  • Typically, the data we work with aren’t available as R packages
  • Rather, we need to read them in from a file
  • csv is a popular format
    • comma separated values
    • plain text
    • can be exported from all popular spreadsheet applications (e.g. Excel)
    • (+ is accessible without any proprietary software, allows for version control etc. – more on this later …)

Example

Student-to-teacher ratios in different parts of the world:

Data set

st_ratios <- read_csv("student_teacher_ratios.csv")
glimpse(st_ratios)
Rows: 180
Columns: 20
$ indicator            <chr> "Primary Education", "Primary Education", "Primar…
$ country              <chr> "Afghanistan", "Albania", "Algeria", "Angola", "A…
$ country_code         <chr> "AFG", "ALB", "DZA", "AGO", "ATG", "ARG", "ARM", …
$ edulit_ind           <chr> "PTRHC_1", "PTRHC_1", "PTRHC_1", "PTRHC_1", "PTRH…
$ year                 <dbl> 2017, 2017, 2017, 2015, 2017, NA, NA, 2017, 2017,…
$ student_ratio        <dbl> 44.00995, 17.94478, 24.22505, 50.02951, 12.05576,…
$ flag_codes           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ flags                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ name                 <chr> "Afghanistan", "Albania", "Algeria", "Angola", "A…
$ alpha.2              <chr> "AF", "AL", "DZ", "AO", "AG", "AR", "AM", "AT", "…
$ alpha.3              <chr> "AFG", "ALB", "DZA", "AGO", "ATG", "ARG", "ARM", …
$ country.code         <chr> "004", "008", "012", "024", "028", "032", "051", …
$ iso_3166.2           <chr> "ISO 3166-2:AF", "ISO 3166-2:AL", "ISO 3166-2:DZ"…
$ region               <chr> "Asia", "Europe", "Africa", "Africa", "North Amer…
$ sub.region           <chr> "Southern Asia", "Southern Europe", "Northern Afr…
$ region.code          <chr> "142", "150", "002", "002", "019", "019", "142", …
$ sub.region.code      <chr> "034", "039", "015", "017", "029", "005", "145", …
$ x                    <dbl> 22, 15, 13, 13, 7, 6, 20, 15, 21, 4, 20, 23, 8, 1…
$ y                    <dbl> 8, 9, 11, 17, 4, 14, 6, 6, 7, 2, 9, 8, 6, 4, 5, 3…
$ student_ratio_region <dbl> 19.64278, 13.01069, 36.38758, 36.38758, 16.18269,…

To work with this data set

  • Download the file student_teacher_ratios.csv from the course website
  • Move the file to a suitable directory on your computer
  • Create a new RStudio project (see next slide)

A project-based workflow

  • RStudio projects are a great way to keep all associated components of a project in the one place:
    • Analysis code (R scripts / Quarto documents)
    • Data
    • Results
  • This further enhances the reproducibility of your analysis

Creating an RStudio project

  • File > New Project
  • choose “Existing directory” – the one that you saved the .csv file to
  • (as you can see, there are also other options)

  • Check RStudio settings

Creating an RStudio project


Suggested reading

R for Data Science - Chapter 9 provides further details on and motivation for a project-based workflow.

Reading in the file

  • open your RStudio project by double clicking on the .Rproj file
  • this will ensure that R’s “working directory” is set correctly
  • make sure that both your Quarto file and the csv file are both in the top-level project directory
  • use the read_csv() function from the {readr} package (part of the tidyverse)
  • you will need to create a new object
st_ratios <- read_csv("student_teacher_ratios.csv")

Brief exploration of student-teacher ratios

Joint exploration

  1. Which region has the highest variability in student-teacher ratios? Visualise this with an appropriate figure. Note the use of flipped x/y axes as well as a custom colour palette and theme in the example.
st_ratios |> 
  ggplot(aes(x = region, 
             y = student_ratio,
             fill = region)) +
  geom_boxplot() +
  # use a custom colour scale
  scale_fill_brewer(palette = "Dark2") +
  # use a custom theme
  theme_bw() +
  # flip x and y axes for a horizontal boxplot
  coord_flip() +
  labs(
    title = "Student-teacher ratio per region",
    x = "Region",
    y = "Student-teacher ratio",
    fill = "Region"
  )

Alternative version

st_ratios |> 
  ggplot(aes(x = region, 
             y = student_ratio)) +
  # use a custom colour to fill all plots
  # (not an aesthetic)
  geom_boxplot(fill = "steelblue") +
  # use another custom theme
  # for this one, we need to install
  # the {cowplot} package
  theme_cowplot() +
  # flip x and y axes for a horizontal boxplot
  coord_flip() +
  labs(
    title = "Student-teacher ratio per region",
    x = "",
    y = "Student-teacher ratio",
    fill = "Region"
  )

A few questions to explore

  1. Focus on the region that you identified in 1.
  • Create a new dataframe just for this region.
  • Which country has the lowest ST-ratio in this region and which has the highest? What are these?
  • Isolate the countries with a ST-ratio higher than the median and plot the ST-ratios for these using an ordered horizontal column graph (recall from the week 4 exercises how to change the order). Use an appealing colour for the columns and pick a theme for the plot that you like.

(If you would like extra practice: try doing the same as in 2 for the region with the lowest variability.)

Resources

Week 5 exercises

Go to the RStudio 2021 conference website and watch one of the talks in the Data for Good section. (They’re all quite short – between 10 and 20 mins in length.) After watching the talk, respond to the following questions and upload your answers (in one document).

  1. Which talk did you choose to watch and why?

  2. What are three key points / insights that you took away from the talk?

  3. Has watching this talk changed your perspective on data science in any way? If yes, how?